Biostat 203B Homework 1

Due Jan 26, 2024 @ 11:59PM

Author

Jiyin (Jenny) Zhang, UID: 606331859

Display machine information for reproducibility:

sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.6.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.3.2    fastmap_1.1.1     cli_3.6.2        
 [5] tools_4.3.2       htmltools_0.5.7   rstudioapi_0.15.0 yaml_2.3.8       
 [9] rmarkdown_2.25    knitr_1.45        jsonlite_1.8.8    xfun_0.41        
[13] digest_0.6.33     rlang_1.1.2       evaluate_0.23    

Q1. Git/GitHub

No handwritten homework reports are accepted for this course. We work with Git and GitHub. Efficient and abundant use of Git, e.g., frequent and well-documented commits, is an important criterion for grading your homework.

  1. Apply for the Student Developer Pack at GitHub using your UCLA email. You’ll get GitHub Pro account for free (unlimited public and private repositories).

  2. Create a private repository biostat-203b-2024-winter and add Hua-Zhou and TA team (Tomoki-Okuno for Lec 1; jonathanhori and jasenzhang1 for Lec 80) as your collaborators with write permission.

  3. Top directories of the repository should be hw1, hw2, … Maintain two branches main and develop. The develop branch will be your main playground, the place where you develop solution (code) to homework problems and write up report. The main branch will be your presentation area. Submit your homework files (Quarto file qmd, html file converted by Quarto, all code and extra data sets to reproduce results) in the main branch.

  4. After each homework due date, course reader and instructor will check out your main branch for grading. Tag each of your homework submissions with tag names hw1, hw2, … Tagging time will be used as your submission time. That means if you tag your hw1 submission after deadline, penalty points will be deducted for late submission.

  5. After this course, you can make this repository public and use it to demonstrate your skill sets on job market.

Answer: Display the URL of my GitHub repository here. https://github.com/Zhangjiyin2000/biostat-203b-2024-winter

Q2. Data ethics training

This exercise (and later in this course) uses the MIMIC-IV data v2.2, a freely accessible critical care database developed by the MIT Lab for Computational Physiology. Follow the instructions at https://mimic.mit.edu/docs/gettingstarted/ to (1) complete the CITI Data or Specimens Only Research course and (2) obtain the PhysioNet credential for using the MIMIC-IV data. Display the verification links to your completion report and completion certificate here. You must complete Q2 before working on the remaining questions. (Hint: The CITI training takes a few hours and the PhysioNet credentialing takes a couple days; do not leave it to the last minute.)

Answer:

I completed the CITI training. Here is the link to my completion report. Here is the link to my completion certificate.

I also obtained the PhysioNet credential for using the MIMIC-IV data. Here is the screenshot of my PhysioNet credential. PhysioNet credential

Q3. Linux Shell Commands

  1. Make the MIMIC v2.2 data available at location ~/mimic.
ls -l ~/mimic/

Refer to the documentation https://physionet.org/content/mimiciv/2.2/ for details of data files. Please, do not put these data files into Git; they are big. Do not copy them into your directory. Do not decompress the gz data files. These create unnecessary big files and are not big-data-friendly practices. Read from the data folder ~/mimic directly in following exercises.

Use Bash commands to answer following questions.

Answer: I created a symbolic link mimic to my MIMIC data folder. Here is the output of ls -l ~/mimic/:

ls -l ~/mimic/
total 48
-rw-rw-r--@  1 zhangjiyin  staff  13332 Jan  5  2023 CHANGELOG.txt
-rw-rw-r--@  1 zhangjiyin  staff   2518 Jan  5  2023 LICENSE.txt
-rw-rw-r--@  1 zhangjiyin  staff   2884 Jan  6  2023 SHA256SUMS.txt
drwxr-xr-x@ 24 zhangjiyin  staff    768 Jan  5 23:41 hosp
drwxr-xr-x@ 11 zhangjiyin  staff    352 Jan  5 23:41 icu
lrwxr-xr-x   1 zhangjiyin  staff     61 Jan 24 22:46 mimic-iv-2.2 -> /Users/zhangjiyin/Desktop/ucla/23-24/winter/203B/mimic-iv-2.2

Here is how I created the symbolic link:

# ln -s /Users/zhangjiyin/Desktop/ucla/23-24/winter/203B/mimic-iv-2.2 ./mimic
  1. Display the contents in the folders hosp and icu using Bash command ls -l. Why are these data files distributed as .csv.gz files instead of .csv (comma separated values) files? Read the page https://mimic.mit.edu/docs/iv/ to understand what’s in each folder.

Answer:

Here is the output of ls -l ~/mimic/hosp/:

ls -l ~/mimic/hosp/
total 8859752
-rw-rw-r--@ 1 zhangjiyin  staff    15516088 Jan  5  2023 admissions.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff      427468 Jan  5  2023 d_hcpcs.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff      859438 Jan  5  2023 d_icd_diagnoses.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff      578517 Jan  5  2023 d_icd_procedures.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff       12900 Jan  5  2023 d_labitems.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff    25070720 Jan  5  2023 diagnoses_icd.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff     7426955 Jan  5  2023 drgcodes.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff   508524623 Jan  5  2023 emar.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff   471096030 Jan  5  2023 emar_detail.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff     1767138 Jan  5  2023 hcpcsevents.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff  1939088924 Jan  5  2023 labevents.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff    96698496 Jan  5  2023 microbiologyevents.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff    36124944 Jan  5  2023 omr.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff     2312631 Jan  5  2023 patients.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff   398753125 Jan  5  2023 pharmacy.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff   498505135 Jan  5  2023 poe.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff    25477219 Jan  5  2023 poe_detail.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff   458817415 Jan  5  2023 prescriptions.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff     6027067 Jan  5  2023 procedures_icd.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff      122507 Jan  5  2023 provider.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff     6781247 Jan  5  2023 services.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff    36158338 Jan  5  2023 transfers.csv.gz

Here is the output of ls -l ~/mimic/icu/:

ls -l ~/mimic/icu/
total 6155968
-rw-rw-r--@ 1 zhangjiyin  staff       35893 Jan  5  2023 caregiver.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff  2467761053 Jan  5  2023 chartevents.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff       57476 Jan  5  2023 d_items.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff    45721062 Jan  5  2023 datetimeevents.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff     2614571 Jan  5  2023 icustays.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff   251962313 Jan  5  2023 ingredientevents.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff   324218488 Jan  5  2023 inputevents.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff    38747895 Jan  5  2023 outputevents.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff    20717852 Jan  5  2023 procedureevents.csv.gz

Gzip compression reduces the size of the files, making them smaller and more efficient for storage and transmission. The gzip compression is lossless, meaning that the decompressed data is identical to the original data. Also, users can download and access compressed files more quickly than their uncompressed counterparts.

Hosp: The Hosp module provides all data acquired from the hospital wide electronic health record. Information covered includes patient and admission information, laboratory measurements, microbiology, medication administration, and billed diagnoses.

ICU: The ICU module contains information collected from the clinical information system used within the ICU. Documented data includes intravenous administrations, ventilator settings, and other charted items.

ED: The ED module contains data for emergency department patients collected while they are in the ED. Information includes reason for admission, triage assessment, vital signs, and medicine reconciliaton. The subject_id and hadm_id identifiers allow MIMIC-IV-ED to be linked to other MIMIC-IV modules.

CXR: The CXR module provides lookup tables linking patient identifiers with MIMIC-CXR study_id and dicom_id, allowing analysis of patient chest x-rays to be linked with the clinical data from other MIMIC-IV modules.

Note: (NOT PUBLICLY AVAILABLE): The Note module contains deidentified free-text clinical notes for hospitalized patients.

  1. Briefly describe what Bash commands zcat, zless, zmore, and zgrep do. Answer:

zcat: zcat is used to display the contents of one or more compressed file without actually uncompressing it. It is equivalent to gzip -cd.

zless: zless is used to view the contents of a compressed file one screen at a time. It is equivalent to gzip -cd | less. less is an improved version of more with additional features. It allows both forward and backward navigation through the file. You can use the arrow keys, Page Up, Page Down, and other keys for navigation. Press ‘q’ to exit. less supports searching, highlighting, and can display line numbers.

zmore: zmore is used to view the contents of a compressed file one screen at a time. It is equivalent to gzip -cd | more. You can press the spacebar to advance to the next screen, and press the Enter key to move one line at a time.

zgrep: zgrep is used to search through one or more compressed files for a string of characters that matches a specified pattern. It is equivalent to gzip -cd | grep.

  1. (Looping in Bash) What’s the output of the following bash script?
for datafile in ~/mimic/hosp/{a,l,pa}*.gz
do
  ls -l $datafile
done

Display the number of lines in each data file using a similar loop. (Hint: combine linux commands zcat < and wc -l.)

  1. Display the first few lines of admissions.csv.gz. How many rows are in this data file? How many unique patients (identified by subject_id) are in this data file? Do they match the number of patients listed in the patients.csv.gz file? (Hint: combine Linux commands zcat <, head/tail, awk, sort, uniq, wc, and so on.)

  2. What are the possible values taken by each of the variable admission_type, admission_location, insurance, and ethnicity? Also report the count for each unique value of these variables. (Hint: combine Linux commands zcat, head/tail, awk, uniq -c, wc, and so on; skip the header line.)

  3. To compress, or not to compress. That’s the question. Let’s focus on the big data file labevents.csv.gz. Compare compressed gz file size to the uncompressed file size. Compare the run times of zcat < ~/mimic/labevents.csv.gz | wc -l versus wc -l labevents.csv. Discuss the trade off between storage and speed for big data files. (Hint: gzip -dk < FILENAME.gz > ./FILENAME. Remember to delete the large labevents.csv file after the exercise.)

Q5. More fun with Linux

Try following commands in Bash and interpret the results: cal, cal 2024, cal 9 1752 (anything unusual?), date, hostname, arch, uname -a, uptime, who am i, who, w, id, last | head, echo {con,pre}{sent,fer}{s,ed}, time sleep 5, history | tail.

Answer: Here is the output of the commands:

cal
    January 2024      
Su Mo Tu We Th Fr Sa  
    1  2  3  4  5  6  
 7  8  9 10 11 12 13  
14 15 16 17 18 19 20  
21 22 23 24 _2_5 26 27  
28 29 30 31           
                      

cal: display the calendar of the current month.

cal 2024
                            2024
      January               February               March          
Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  
    1  2  3  4  5  6               1  2  3                  1  2  
 7  8  9 10 11 12 13   4  5  6  7  8  9 10   3  4  5  6  7  8  9  
14 15 16 17 18 19 20  11 12 13 14 15 16 17  10 11 12 13 14 15 16  
21 22 23 24 _2_5 26 27  18 19 20 21 22 23 24  17 18 19 20 21 22 23  
28 29 30 31           25 26 27 28 29        24 25 26 27 28 29 30  
                                            31                    

       April                  May                   June          
Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  
    1  2  3  4  5  6            1  2  3  4                     1  
 7  8  9 10 11 12 13   5  6  7  8  9 10 11   2  3  4  5  6  7  8  
14 15 16 17 18 19 20  12 13 14 15 16 17 18   9 10 11 12 13 14 15  
21 22 23 24 25 26 27  19 20 21 22 23 24 25  16 17 18 19 20 21 22  
28 29 30              26 27 28 29 30 31     23 24 25 26 27 28 29  
                                            30                    

        July                 August              September        
Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  
    1  2  3  4  5  6               1  2  3   1  2  3  4  5  6  7  
 7  8  9 10 11 12 13   4  5  6  7  8  9 10   8  9 10 11 12 13 14  
14 15 16 17 18 19 20  11 12 13 14 15 16 17  15 16 17 18 19 20 21  
21 22 23 24 25 26 27  18 19 20 21 22 23 24  22 23 24 25 26 27 28  
28 29 30 31           25 26 27 28 29 30 31  29 30                 
                                                                  

      October               November              December        
Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  
       1  2  3  4  5                  1  2   1  2  3  4  5  6  7  
 6  7  8  9 10 11 12   3  4  5  6  7  8  9   8  9 10 11 12 13 14  
13 14 15 16 17 18 19  10 11 12 13 14 15 16  15 16 17 18 19 20 21  
20 21 22 23 24 25 26  17 18 19 20 21 22 23  22 23 24 25 26 27 28  
27 28 29 30 31        24 25 26 27 28 29 30  29 30 31              
                                                                  

cal 2024: display the calendar of the year 2024.

cal 9 1752
   September 1752     
Su Mo Tu We Th Fr Sa  
       1  2 14 15 16  
17 18 19 20 21 22 23  
24 25 26 27 28 29 30  
                      
                      
                      

cal 9 1752: display the calendar of the month September in the year 1752. The calendar of September 1752 is unusual because the Gregorian calendar was adopted in the British Empire in September 1752. The calendar was changed from the Julian calendar to the Gregorian calendar. The Julian calendar was 11 days behind the Gregorian calendar. So the 11 days from September 3 to September 13 were skipped.

date
Thu Jan 25 10:41:21 PST 2024

date: display the current date and time.

hostname
zhangjiyindeAir.lan

hostname: display the name of the host.

arch
arm64

arch: display the machine hardware name.

uname -a
Darwin zhangjiyindeAir.lan 21.6.0 Darwin Kernel Version 21.6.0: Thu Mar  9 20:10:19 PST 2023; root:xnu-8020.240.18.700.8~1/RELEASE_ARM64_T8101 arm64

uname -a: display the system information.

uptime
10:41  up 2 days,  9:36, 2 users, load averages: 1.82 1.53 1.44

uptime: display the current time, how long the system has been running, how many users are currently logged on, and the system load averages for the past 1, 5, and 15 minutes.

who am i
zhangjiy tty??    Jan 25 10:41 

who am i: display the current user.

who
zhangjiyin console  Jan 23 01:06 
zhangjiyin ttys000  Jan 24 00:52 

who: display the users who are currently logged in.

# w 

w: display the users who are currently logged in and what they are doing.

id
uid=501(zhangjiyin) gid=20(staff) groups=20(staff),12(everyone),61(localaccounts),79(_appserverusr),80(admin),81(_appserveradm),98(_lpadmin),33(_appstore),100(_lpoperator),204(_developer),250(_analyticsusers),395(com.apple.access_ftp),398(com.apple.access_screensharing),399(com.apple.access_ssh),400(com.apple.access_remote_ae),701(com.apple.sharepoint.group.1)

id: display the user and group information for the current user.

last | head
zhangjiyin  ttys000                   Wed Jan 24 00:52   still logged in
zhangjiyin  console                   Tue Jan 23 01:06   still logged in
reboot    ~                         Tue Jan 23 01:05 
zhangjiyin  console                   Mon Jan 22 14:51 - 01:04  (10:13)
reboot    ~                         Mon Jan 22 14:42 
shutdown  ~                         Mon Jan 22 14:42 
zhangjiyin  ttys000                   Thu Jan 18 10:16 - 10:16  (00:00)
zhangjiyin  ttys000                   Thu Jan 18 10:11 - 10:11  (00:00)
zhangjiyin  ttys000                   Thu Jan 18 10:10 - 10:10  (00:00)
zhangjiyin  ttys000                   Thu Jan 18 09:10 - 09:10  (00:00)

last | head: display the last logged in users.

echo {con,pre}{sent,fer}{s,ed}
consents consented confers confered presents presented prefers prefered

echo {con,pre}{sent,fer}{s,ed}: display the words “consents”, “confer”, “presents”, “present”, “consented”, “conferred”, “presented”, “presented”.

time sleep 5

real    0m5.005s
user    0m0.000s
sys 0m0.001s

time sleep 5: display the time it takes to run the command sleep 5.

history | tail

history | tail: display the last 10 commands in the history.

Q6. Book

  1. Git clone the repository https://github.com/christophergandrud/Rep-Res-Book for the book Reproducible Research with R and RStudio to your local machine.

  2. Open the project by clicking rep-res-3rd-edition.Rproj and compile the book by clicking Build Book in the Build panel of RStudio. (Hint: I was able to build git_book and epub_book but not pdf_book.)

The point of this exercise is (1) to get the book for free and (2) to see an example how a complicated project such as a book can be organized in a reproducible way.

For grading purpose, include a screenshot of Section 4.1.5 of the book here.

Answer:

I was also able to build git_book and epub_book but not pdf_book. Here is the screenshot of Section 4.1.5 of the git_book. Section 4.1.5 of the git_book Here is the screenshot of Section 4.1.5 of the epub_book. Section 4.1.5 of the epub_book